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We propose a method for constructing p-values for general hypotheses in a high-dimensional 
linear model. The hypotheses can be local for testing a single regression parameter or they may 
be more global involving several up to all parameters. Furthermore, when considering many 
hypotheses, we show how to adjust for multiple testing taking dependence among the p-values 
into account. Our technique is based on Ridge estimation with an additional correction term due 
to a substantial projection bias in high dimensions. We prove strong error control for our p-values 
and provide sufficient conditions for detection: for the former, we do not make any assumption on 
the size of the true underlying regression coefficients. We demonstrate the method in simulated 
examples and a real data application. 
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1. Introduction 

Many data problems nowadays carry the structure that the number p of covariables may 
greatly exceed sample size n, i.e., p 3> f^. In such a setting, a huge amount of work 
has been pursued addressing prediction of a new response variable, estimation of an 
underlying parameter-vector and variable selection, see for example the books by Hastie 
et al. (2009), Biihlmann and van de Geer (2011) or the more specific review article by 
Fan and Lv (2010). With a few exceptions, see Section 1.3.1, the proposed methods and 
presented mathematical theory do not address the problem of assigning uncertainties, 
statistical significance or confidence: thus, the part of statistical hypothesis testing and 
construction of confidence intervals is largely unexplored and underdeveloped. Yet, such 
significance or confidence measure are crucial in applications where interpretation of 
parameters and variables is very important. The focus of this paper is the construction 
of p-values and corresponding multiple testing adjustment for a high-dimensional linear 
model which is often very useful in p 3> settings: 



Y = X/S'J + £, 



(1.1) 



2 



P. Biihlmann 



where Y = (Yi, . . . , Yn)'^ , X is a fixed design nxp design matrix, is the true underlying 
p X 1 parameter vector and e is the n x 1 stochastic error vector with £i, . . . ,e„ i.i.d. 
having E[ei] = and Var(ej) = cr^ < oo; throughout the paper, p may be much larger n. 
We are interested in testing one or many null-hypotheses of the form: 

Ho,g; /3° = for all j e G, (1.2) 

where G C {l,...,p} is a subset of all the indices of the covariables. Of substantial 
interest is the case where G = {j} corresponding to a hypothesis for the individual jth 
regression parameter (j = 1, . . . ,p). At the other end of the spectrum is the global null- 
hypothesis where G ^ {I, . . . ,p}, and we allow for any G between an individual and the 
global hypothesis. 

1.1. Past work about high-dimensional linear models 

We review in this section an important stream of research for high-dimcnsional linear 
models. The more familiar reader may skip Section 1.1. 

1.1.1. The Lasso 

The Lasso (Tibshirani, 1996) 

/^Lasso = /3Lasso(A) = argmin^dlY - X/3|l2/n + A|l/3|li), 

has become tremendously popular for estimation in high-dimensional linear models. The 
three main themes which have been considered in the past are prediction of the regression 
surface (and for a new response variable) with corresponding measure of accuracy 

||X(/3La«so-/3°)||^/n, (1.3) 

estimation of the parameter vector whose quality is judged by 

||/3Lasso-/3"||.; (ge{l,2}), (1.4) 

and variable selection or estimating the support of denoted by the active set Sq = 
{j; /3jV0, i = l,...,p} such that 

P[5 = ^o] (1-5) 

is large for a selection (estimation) procedure S. 

Greenshtein and Ritov (2004) proved the first result closely related to prediction as 
measured in (1.3). Without any conditions on the deterministic design matrix X one has 
with high probability at least 1 — 2 cxp(— f;^/2): 

||X(^La.so(A)-/3°)||^/n<3/2A|l/3"||i, 

A = 4../^i+^, (1.6) 
V n 
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see Biihlmann and van de Geer (2011, Cor. 6.1). Thereby, we assume Gaussian errors but 
such an assumption can be relaxed (Biihlmann and van de Geer, 2011, formula (6.5)). 
From an asymptotic point of view (where p and n diverge to oo), the regularization 
parameter A x y^\og{p) jn leads to consistency for prediction if the truth is sparse with 
respect to the £i-norm such that ||/3°||i = o(A^^) = o{\/n/ \og{p)). The convergence rate 
is then at best Op(A) = O p{^J\og{p) / n) . 

Such a slow rate of convergence can be improved under additional assumptions on 
the design matrix X. The ill-posedness of the design matrix can be quantified using the 
concept of "modified" eigenvalues. Consider the matrix E = n~^X-^X. The smallest 
eigenvalue of S is 

Amin(S) = nun/3^E/3. 

Of course, Amin(S) equals zero lip > n. Instead of taking the minimum on the right-hand- 
side over all p X 1 vectors /3, we replace it by a constrained minimum, typically over a 
cone. This leads to the concept of restricted eigenvalues (Bickcl ct al., 2009; Koltchinskii, 
2009a, b; Raskutti et al., 2010) or weaker forms such as the compatibility constants (van 
dc Geer, 2007) or further slight weakening of the latter (Sun and Zhang, 2011). Relations 
among the different conditions and "modified" eigenvalues are discussed in van de Geer 
and Biihlmann (2009) and Biihlmann and van de Geer (2011, Ch.6.13). Assuming that 
the smallest "modified" eigenvalue is larger than zero, one can derive an oracle inequality 
of the following prototype: with probability at least 1 — 2exp(— 1^/2) and using A as in 

(1.6) : 

l!X(/3Lasso(A)-/3°)||2/n-fA||/3Lasso-/3°||i <4A2so/(/.g, (1.7) 

where 4>o is the compatibility constant (smallest "modified" eigenvalue) of the fixed design 
matrix X (Biihlmann and van de Geer, 2011, Cor. 6. 2). Again, this holds by assuming 
Gaussian errors but the result can be extended to non-Gaussian distributions. From 

(1.7) , we have two immediate implications: from an asymptotic point of view, using 
A X y^log{p)/n and assuming that cpa is bounded away from 0, 

||X(/3Lasso(A) - I3")\\l/n = O p{so\og{p )/n), (1.8) 
||/3Las.o(A) - = Op (so Vlog(p)/n), (1.9) 

i.e., a fast convergence rate for prediction as in (1.8) and an ^i-norm bound for the esti- 
mation error. We note that the oracle convergence rate, where an oracle would know the 
active set Sq, is Op{so/n): the log(p)-factor is the price to pay by not knowing the active 
set 5*0. An f2-norm bound can be derived as well: ||/3Lasso(A) — (3°\\2 = Op{yJsQ \og{p)/n) 
assuming a slightly stronger restricted eigenvalue condition. Results along these lines 
have been established by Bunea et al. (2007), van de Geer (2008) who covers generalized 
linear models as well, Zhang and Huang (2008), Meinshauscn and Yu (2009), Bickcl et al. 
(2009) among others. 
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The Lasso is doing variable selection: a simple estimator of the active set 5*0 is 
'S'Lasso(A) = {j; /3Lasso;j(A) ^ 0}. In order that 5'Lasso(A) has a good accuracy for So, 
we have to require that the non-zero regression coefficients are sufficiently large (since 
otherwise, we cannot detect the variables in So with high probability). We make a "beta- 
min" assumption whose asymptotic form reads as 

min |/3°| » ^solog{p)/n. (1.10) 

Furthermore, when making a restrictive assumption for the design, called neighborhood 
stability, or assuming the equivalent irrepresentable condition, and choosing a suitable 
A> Vlog(p)/n: 

P[^Lasso(A) = 5o] ^ 1, 

see Meinshausen and Biihlmann (2006), Zhao and Yu (2006), and Wainwright (2009) 
establishes exact scaling results. The "beta-min" assumption in (1.10) as well as the 
irrepresentable condition on the design are restrictive and non-checkable. Furthermore, 
these conditions are essentially necessary (Meinshausen and Biihlmann, 2006; Zhao and 
Yu, 2006). Thus, under weaker assumptions, we can only derive a weaker yet useful result 
about variable screening. Assuming a restricted eigenvalue condition on the fixed design 
X and assuming the "beta-min" condition in (1.10) we still have asymptotically that for 
A X V'log(p)/"- 

P[5'(A) C 5*0] ^ 1 (n^ oo). (1.11) 

The cardinality of the estimated active set (typically) satisfies |'5'(A)| < min(n,p): thus 
if p 3> we achieve a massive and often useful dimensionality reduction in the original 
covariates. 

We summarize that a slow convergence rate for prediction "always" holds. Assuming 
some "constrained minimal eigenvalue" condition on the fixed design X, we obtain the 
fast convergence rate in (1.8), and an estimation error bound as in (1.9); with the addi- 
tional "beta-min" assumption, we obtain the practically useful variable screening prop- 
erty in (1.11). For consistent variable selection, we necessarily need a (much) stronger 
condition on the fixed design, and such a strong condition is questionable to be true in a 
practical problem. Hence variable selection might be a too ambitious goal with the Lasso. 
That is why the original translation of Lasso (Least Absolute Shrinkage and Selection 
Operator) may be better re-translated as Least Absolute Shrinkage and Screening Op- 
erator. We refer to Biihlmann and van dc Geer (2011) for an extensive treatment of the 
properties of the Lasso. 

1.1.2. Other methods 

Of course, the three main inference tasks in a high-dimensional linear model, as described 
by (1.3), (1.4) and (1.5), can be pursued with other methods than the Lasso. 

An interesting line of proposals include concave penalty functions instead of the £i- 
norm in the Lasso (Fan and Li, 2001), and further developed by Zhang (2010). The 
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adaptive Lasso (Zou, 2006), analyzed in the high-dimensional setting by Huang et al. 
(2008) and van de Geer ct al. (2011), can be interpreted as an approximation of some 
concave penalization approach (Zou and Li, 2008). A related procedure to the adaptive 
Lasso is the relaxed Lasso (Meinshausen, 2007). Another method is the Dantzig selector 
(Candcs and Tao, 2007) which has similar statistical properties as the Lasso (Bickel et al., 
2009). Other algorithms include orthogonal matching pursuit (which is essentially forward 
variable selection) or L2 Boosting (matching pursuit) which have desirable properties 
(Tropp, 2004; Biihlmann, 2006). 

Quite different from estimation of the high-dimensional parameter vector are variable 
screening procedures which aim for an analogous property as in (1.11). Prominent exam- 
ples include the SIS-method (Fan and Lv, 2008) and high-dimensional variable screening 
or selection properties have been established for forward variable selection (Wang, 2009) 
and for the PC-algorithm (Biihlmann et al., 2010). 

1.2. Assigning uncertainties and p-values for high-dimensional 
regression 

At the core of statistical inference is the specification of statistical uncertainties, signif- 
icance and confidence. For example, instead of having a variable selection result where 
the probability in (1.5) is large, we would like to have measures controlling a type I error 
(false positive selections), including p-values which are adjusted for large-scale multiple 
testing, or construction of confidence intervals or regions. In the high-dimensional setting, 
answers to these core goals are challenging. 

Meinshausen and Biihlmann (2010) propose Stability Selection, a very generic method 
which is able to control the expected number of false positive selections: that is, denoting 
by V = \S O Sq\, Stability Selection yields a finite-sample upper bound of E[V] (not 
only for linear models but for many other inference problems). To achieve this, a very 
restrictive (but presumably non-necessary) exchangeability condition is made which, in a 
linear model, is implied by a restrictive assumption for the design matrix. On the positive 
side, there is no requirement of a "beta-min" condition as in (1.10) and the method seems 
to give reliable control of E[V^]. 

Wasserman and Roeder (2009) propose a procedure for variable selection based on 
sample splitting. Using their idea and extending it to multiple sample splitting, Mein- 
shausen et al. (2009) develop a much more stable method for construction of p-values 
for hypotheses Hq j : /3j — Q {j ^ 1, . . . ,p) and for adjusting them in a non-naive way 
for multiple testing over p (dependent) tests. The main drawback of this procedure is 
its required "beta-min" assumption in (1.10). And this is very undesirable since for sta- 
tistical hypothesis testing, the test should control type I error regardless of the size of 
the coefficients, while the power of the test should be large if the absolute value of the 
coefficient would be large: thus, we should avoid assuming (1.10). 

Up to now, for the high-dimensional linear model case with p ^ n, it seems that only 
Zhang and Zhang (2011) managed to construct a procedure which leads to statistical 
tests for Hqj, and even to confidence intervals for the true underlying parameters /3^, 
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without assuming a "beta-min" condition. 

1.3. A loose description of our new results 

We start with considering Ridge regression for estimating the high-dimensional regression 
parameter. We then develop a bias correction, addressing the issue that Ridge regression 
is estimating the regression coefficient vector projected to the row space of the design 
matrix: the corrected estimator is denoted by /3corr- 

Theorem 1 describes that under the null-hypothesis, the distribution of a suitably 
normalized a„_p |/3corr| can be stochastically (componentwise) upper-bounded: 

as 

a„,p|/3corr| ^ {\Zj\ + A,)P^,, 

{Zi,...,Zp)r^MpiO,a^n-'n), (1.12) 

for some known positive definite matrix and some known constants Aj . This is the key 
to derive p-values based on this stochastic upper bound. It can be used for construction 
of p-values for individual hypotheses Ho j as well as for more global hypotheses Hq c 
for any subset G C {l,...,p}, including cases where G is (very) large. Furthermore, 
Theorem 2 justifies a simple approach for controlling the family wise error rate when 
considering multiple testing of regression hypotheses. Our multiple testing adjustment 
method itself is closely related to the Westfall- Young permutation procedure (Wcstfall 
and Young, 1993) and hence, it offers high power, especially in presence of dependence 
among the many test-statistics (Mcinshauscn ct al., 2011). 

1.3.1. Relation to other work 

Our new method as well as the approach in Zhang and Zhang (2011) provide p-values (and 
the latter also confidence intervals) without assuming a "beta-min" condition. Both of 
them build on using linear estimators and a correction using a non-linear initial estimator 
such as the Lasso. Using e.g. the Lasso directly leads to the problem of characterizing 
the distribution of the estimator (in a tractable form): this seems very difficult in high- 
dimensional settings while it has been worked out for low-dimensional problems (Knight 
and Fu, 2000). The work by Zhang and Zhang (2011) is the only one which studies 
(sufficiently closely) related questions and goals. 

The approach by Zhang and Zhang (2011) is based on the idea of projecting the high- 
dimensional parameter vector to low-dimensional components, as occurring naturally in 
the hypotheses Hq j about single components, and then proceeding with a linear estima- 
tor. This idea is pursued with the "efficient score function" approach from semiparametric 
statistics (Bickel et al., 1998). The difficulty in the high-dimensional setting is the con- 
struction of the score vector Zj from which one can derive a confidence interval for /3^: 
Zhang and Zhang (2011) propose it as the residual vector from the Lasso when regress- 
ing X^-'^ against all other variables X^^-*-* (where X^"^-* denotes the columns of the design 
matrix corresponding to index set JC{l,...,p}). They then prove the asymptotic va- 
lidity of confidence intervals for finite, sparse linear combinations of . The difference 
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to our work is primarily a rather different construction of the projection where we make 
use of Ridge estimation with a very simple choice of regularization. Furthermore, our 
approach and analysis allows for the construction of p-values of hypotheses Hq q where 
the cardinality of G may be large, and we develop a simple method for multiple testing 
adjustment to control the familywise error rate. On the other hand, we only focus on 
hypotheses testing and we do not address the issue of constructing confidence intervals. 

2. Model, estimation and p-values 

Consider one or many null- hypotheses as in (1.2). We are interested in constructing p- 
values for hypotheses Hq c without imposing a "beta-min" condition as in (1.10): the 
statistical test itself will distinguish whether a regression coefficient is small or not. 

2.1. Identifiability 

We consider model (1.1) with fixed design. Without making additional assumptions on 
the design matrix X, there is a problem of identifiability. Clearly, ii p > n and hence 
rank(X) < n < p, there are different parameter vectors 9 such that X/3" — J^9. Thus, 
we cannot identify /?" from the distribution of Yi, . . . , y„ (and fixed design X). 

Shao and Deng (2011) give a characterization of identifiability in a high-dimensional 
linear model (1.1) with fixed design. Following their approach, it is useful to consider the 
singular value decomposition 

X = RSV^, 

R n X n matrix with U'^U = In: 

S n X n diagonal matrix with singular values si, . . . , s„, 
V p X n matrix with V'^V — In- 

Denote by 7?.(X) C W' the linear space generated by the n rows of X. The projection of 
W onto 7^(X) is then 

Px = X^(XX^)-X = VV'^: 

where A~ denotes the pseudo-inverse of a squared matrix A. 

A natural choice of a parameter 6^ such that X/?" = X0° is the projection of /3° onto 
7^(X). Thus, 

6*° = Fx/3° = VV'^P°- (2.1) 
Then, of course, ^° G 7^(X) if and only if = 6*°. 
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2.2. Ridge regression 

Consider Ridge regression 

P = argmin^llY - X/3||^/n + X\\[3\\l = (n^^X^X + XIp)-^n-^X^Y , (2.2) 

where A = A„ is a regularization parameter. By construction of the estimator, /3 e 7?.(X); 
and indeed, as discussed below, /3 is a reasonable estimator for 9^ — Pxf3^- We denote 

by 

E = n-^X^X. 

The covariance matrix of the Ridge estimator, multiplied by n, is then 

n = n{x) = {t + x,j)-'^t{t + 

a quantity which will appear at many places again. We assume that 

n^in{X):= min {{± + XI)-'±{± + XI)-') > 0. (2.3) 

]€{!,. ..,p} " 

We do not (A) is bounded away from zero as a function of n and p. Thus, 

the assumption in (2.3) is very mild: a rather peculiar design would be needed to violate 
the condition, see also the equivalent formulation in formula (2.4) below. Furthermore, 
(2.3) is easily checkable. 

We denote by X^in^Q{A) the smallest non-zero eigenvalue of a symmetric matrix A. 
We then have the following result. 

Proposition 1. Consider the Ridge regression estimator /3 in (2.2) with regularization 
parameter A > 0. Assume condition (2.3), see also (2.4). Then, 

max |E[/3j] - B°]\ < A||^?"||2A,„i„^o(S)"\ 
iG{i,...,p} 

min Var(/3j) > V^r2niin(A). 
j"e{i,...,p} 

A proof is given in Supplementary Section 8.1, relying in large parts on Shao and 
Deng (2011). We now discuss under which circumstances the estimation bias is smaller 
than the standard error. Qualitatively, this happens if A > is chosen sufficiently small. 
For a more quantitative discussion, we study the behavior of f2inin(A) as a function of 
A and we obtain an equivalent formulation of (2.3). We use the spectral decomposition 
of E = UDU^ with UU'^ = U'^U = Ip and D = diag(i:iii, . . .,Dpp) consisting of the 
ordered eigenvalues Dn < D22 < • • ■ < Dpp of E. Then, 

n = £/diag( . . . , _^^)f/^. 

(I^ii+A)^ [Dpp + Xy 

Denote by q the smallest index such that Dgq > 0, and thus Dgg = Ami„^o(S): if 
rank(X) ~ n < p, then q = p — n + 1. 
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Lemma 1. We have the following: 
1. 



r—q ^ ' 



From this we get: 



(2.3) holds if and only if min max [/? > 0. (2.4) 

l<j<p q<r<p 



2. Assuming (2.3), 



f^mi„(0+) := lim n^in{X) ^ min^ TT^jr > 



3. 



if (2.3) holds: < Lc < lim infr2mi„(A) < Afc < oo, (2.5) 

Ae(o,c] 

for any < C < oo, and where < Lc < Mc < oo are constants which depend on 
C (and on the design matrix X). 

The proof is straightforward using the spectral decomposition given above. From 
Proposition 1 we immediately obtain the following result. 

Corollary 1. Consider the Ridge regression estimator /3 in (2.2) with regularization 
parameter A > satisfying 

Af^„,i„(A)-i/2 < n-'/2„pO\\-ix,^,,^o{t). (2.6) 

In addition, assume condition (2.3), see also (2.4). Then, 

max (E[/3,] - 6l°])2 < min Var(/3,-)- 
i6{i,-,p} ie{i,...,p} 

Due to the third statement in Lemma 1 regarding the behavior of f2miii(A), (2.6) can be 
fulfilled for a sufficiently small value of A. 



2.3. The projection bias and corrected Ridge regression 

As discussed in Section 2.1, Ridge regression is estimating the parameter d'^ — Px/3° 
given in (2.1). Thus, in general, besides the estimation bias governed by the choice of A, 
there is an additional projection bias = 6*° — /3° {j — 1, . . . ,p). Clearly, 

B, = (Px/3°), - = (Px),,/3° - /3° + 5](Px),fc/3fe°. 
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In terms of constructing p-values, controlling type I error for testing Hqj or -ffo.G with 
j G G, the projection bias has only a disturbing effect if /3° = and 9'j ^ 0, and we only 
have to consider the bias under the null-hypothesis: 

Bho;j - (2.7) 
Under Hqj, or i?o,G with j e G, we then have Bh^j = 9^. We can estimate Bho J by 

where /3init is an initial estimator such as the Lasso which guarantees a certain estimation 
accuracy, see assumption (A) below. This motivates the following bias-corrected Ridge 
estimator for testing Hq j, or Hq g with j € G: 

Pcorr-j — Pj ~ -B_ffo;j = Pj ^ ^ (-Px) jfc /3init;fc . (2.8) 

We then have the following representation. 

Proposition 2. Assume model (1.1) with Gaussian errors. Consider the corrected 
Ridge regression estimator /3corr in (2.8) with regularization parameter A > 0, and assume 
(2.3). Then, 

/3corr;j = + 7j (j = 1, ■ • ■ ,P) 

Zi, . . . , Zp - A/'p(0, V^ll), n = f7(A), 

7, = (Px),,/?° - E(^x),fc(A„it;fc - /3^) + fe,(A), 

&,(A)-E[/3,(A)]-(?o. 

A proof is given in Supplementary Section 8.1. The normalizing factors for the variables 
Zj bringing them to the A/'(0, l)-scale are 

a„,p;j(a)-ni/V-il]7//' (j- = l,...,p) 

which are also depending on A through — ri(A). We refer to Section 4.1 where the 
unusually fast divergence of an.p-j{<j) is discussed. The test-statistics we consider are 
simple functions of a„,p;j(cr)/3corr;j. 

2.4. Stochastic bound for the distribution of the corrected Ridge 
estimator: asymptotics 

We provide here an asymptotic stochastic bound for the distribution of dn.py (o')/3corr:j 
under the null-hypothesis. The asymptotic formulation is compact and the basis for the 
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construction of p- values in Section 2.5, but we give more detailed finite-sample results in 
Section 6. 

We consider a triangular array of observations from a linear model as in (1.1): 

Y„ =X„/3°+£„, n = 1,2,..., (2.9) 

where all the quantities and also the dimension p — p„ are allowed to change with n. We 
make the following assumption. 

(A) There are constants Aj — Aj_„ > such that 

P[n^=l{|an.pu('^) 5I(^x)jfc(Anit;fc " /3fc)| < A,, „}] 1 (u ^ C»). 

We will discuss in Section 2.4.1 constructions for such bounds Aj. Our next result is 
the key to obtain a p- value for testing the null-hypothesis Hqj or Hq^q, saying that 
asymptotically, 

as. 

an,pu(0-)/3corr;j ^ |M^|-|-Aj, 

where W ~ A/'(0, 1), and similarly for the multi-dimensional version with /3corr;G (where 
< denotes "stochastically smaller or equal to" ) . 

Theorem 1. Assume model (2.9) with fixed design and Gaussian errors. Consider the 
corrected Ridge regression estimator /3corr in (2-8) with regularization parameter A„ > 
such that 

A„r!n.i„(A„)-i/2 = o(min(n-i/2||0O||-i^^^.^_^^(^))) ^ 

and assume condition (A) and (2.3) (while for the latter, the quantity does not need to 
be bounded away from zero). Then, for j G {1, . . . ,p„} and if Hq j holds: for all u G M, 

limsup (P[a„,p;j(a)|/3corrul >u]~ V[\W\ + A^- > u]) < 0, 

n— f oo 

where W ~ A/'(0, 1). Similarly, for any sequence of subsets {G„}„, G„ C {1, . . . ,p„} and 
if 7Jo,G„ holds: for ah m G M+, 

limsup (P[max a„.p;j ((T)|/3corr:j| > It] — P[max{an.p:j{(7)\Zj\ + Aj) > u]) < 0, 
where Zi, . . . , Zp are as in Proposition 2. 

A proof is given in Supplementary Section 8.1. We note that the distribution of 
maxjgG^(a„_p;j((T)|Zj| -I- A^) does not depend on a and can be easily computed via 
simulation. 
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2.4.1. Bounds Aj in assumption (A) 

We discuss an approach for constructing the bounds Aj . As mentioned above, they should 
not involve any unknown quantities so that we can use them for constructing p-values 
from the distribution of \W\ + Aj or maxjgG^(a„^p;j(cr)|Zj| + Aj), respectively. 
We rely on the (crude) bound 

|a».P;j (^) 'E(^x),fc (Anit;fe - /3fc) I < an,p;, {o) max | (Px)jfc 1 1| A„it - || 1 • (2.10) 
To proceed further, we consider the Lasso as initial estimator. Due to (1.7) we obtain 

l«n,Pj('^) y^(-Fx)jfc(/3init;fc -/3fc)| < max |a„,p;j (ct) (Px)jfc |4ALassoSO0o ^ i (2-11) 

where the last inequality holds on a set with probability at least 1 — 2exp(— i^/2) when 
choosing ALasso as in (1.6). The assumptions we require are summarized next. 

Lemma 2. Consider the linear model (2.9) with fixed design which satisfies the com- 
patibility condition with constant (/)q = (/)q „ . Consider the Lasso as initial estimator Pi^it 
with regularization parameter ALasso = 4(T-y/C log(p„)/n for some 2 < C < c». Assume 
that the sparsity sq — so,„ = o((n/ log(p„))^) (n — > 00) for some < ^ < 1/2, and that 
liminf„^oo 0o,n > 0- Then, 

A, :=max|a„,p;,(a)(Px),fc|(log(p)/n)i/2-? (2.12) 

satisfies assumption (A). 

A proof follows from (2.11). When assuming bounded sparsity So.„ < M < 00 for all 
n, we can choose = with an additional constant M on the right-hand side of (2.12). 
In our practical examples, we use ^ — 0.05. We summarize the results as follows. 

Corollary 2. Assume the conditions of Theorem 1 without condition (A) and the 
conditions of Lemma 2. Then, when using the Lasso as initial estimator, the statements 
in Theorem 1 hold. 

2.5. P-values 

Our construction of p-values is based on the asymptotic distributions in Theorem 1. For 
an individual hypothesis Hqj, we define the p- value for the two-sided alternative as 

Pj = 2(1 - $((a„,p;,(a)|/3eorr;,| " Aj) + )). (2.13) 

Of course, we could also consider one-sided alternatives with the obvious modification 
for Pj. For a more general hypothesis i?o,G with \G\ > 1, we use the maximum as test 
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statistics (but other statistics such as weighted sums could be chosen as weU) and denote 

by 

7G = max an,p-j (cr) |4orr„ I , 

Jg(c) = P[max(a„,py(a)|Zj| + A^-) < c], 

where the latter is independent of a and can be easily computed via simulation (Zi, . . . ^ Zp 
are as in Proposition 2). Then, the p- value for Hq^g, against the alternative being the 
complement Hq q , is defined as 

Pg = 1 - Jg(7g). (2.14) 

We note that when Aj = A is the same for all j, we can rewrite Pq = 1— P[maxjgG o-n,p-j (c) 
\Zj\ < (7g — A)+] which is a direct analogue of (2.13). 

Error control follows immediately by the construction of the p- values. 

Corollary 3. Assume the conditions in Theorem 1. Then, for any < a < 1, 

limsupP[Pj < a] - a < if Hqj holds, 

71— J-OO 

limsupP[PG < a] - a < if Ho,g holds. 

n— f oo 

Furthermore, for any sequence a„ — > (n — > oo) which converges sufficiently slowly, the 
statements also hold when replacing a by a„. 

A discussion about detection power of the method is given in Section 4. Further 
remarks about these p- values are given in Supplementary Section 8.4. 

2.5.1. Estimation of a 

In practice, for the p-values in (2.13) and (2.14), we use the normalizing factor a„^p;j(tT) 
with an estimate a. These p-values are asymptotically controlling the type I error if 
P[(7 > cr] — > 1 (n — > oo). This follows immediately from the construction. 

We propose to use the estimator a from the Scaled Lasso method (Sun and Zhang, 
2011). Assuming SQ\og{p)/n = o(l) (n — > oo) and the compatibility condition for the 
design. Sun and Zhang (2011) prove that \(j /a — 1| = op{l) [n oo). 

3. Multiple testing 

We aim to strongly control the family wise error rate V\V > 0] where V is the number of 
false positive selections. For simplicity, we consider first individual hypotheses Hqj {j S 
{1, . . . ,p}). The generalization to multiple testing of general hypotheses Hq g with \G\ > 
1 is discussed in Section 3.2. 
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Based on the individual p- values Pj, we want to construct corrected p- values Pcorr-j 
corresponding to the following decision rule: 

reject Hqj if Peony < a (0 < a < 1). 

We denote the associated estimated set of rejected hypotheses (the set of significant 
variables) by Sa = {j; Pcorr-j < «}■ Furthermore, recall that = {j; /3° ^ 0} is the 
set of true active variables. The number of false positives using the nominal significance 
level a is the denoted by 

Va — Sa n S'q. 

The goal is to construct Pcorry" such that P[Vq > 0] < a, or that the latter holds at least 
in an asymptotic sense. The method we describe here is closely related to the Westfall- 
Young procedure (Westfall and Young, 1993). 

Consider the variables Zi,...,^^ ^ A/'p(0, cr^n~^il) appearing in Proposition 2 or 
Theorem 1. Consider the following distribution function: 

Fz{c) = P[ min 2(1 - $(a„.p;j(a)|Zj|)) < c]. 
i<i<p 

and define 

Pcorru =Pz(P,+C), (3-1) 

where C > is an arbitrarily small number, e.g. Q = 0.01 for using the method in practice. 
Regarding the choice of C = (which we use in all empirical examples in Section 5, 
see the Remark appearing after Theorem 2 below). The distribution function Fz{-) is 
independent of a and can be easily computed via simulation of the dependent, mean zero 
jointly Gaussian variables Zi, Zp. It is computationally (much) faster than simulation 
of the so-called minP-statistics (Westfall and Young, 1993) which would require fitting 
/3corr many times. 

3.1. Asymptotic justification of the multiple testing procedure 

We first derive family wise error control in an asymptotic sense. For a finite sample result, 
see Section 6. We consider the framework as in (2.9). 

Theorem 2. Assume the conditions in Theorem 1. For the p-value in (2.13) and using 
the correction in (3.1) with C > we have: for < a < 1, 

limsupP[ya > 0] < a. 

n— fcjo 

Furthermore, for any sequence q;„ — (n — )■ oo) which converges sufficiently slowly, it 
holds that limsup„^oo ^[^a„ > 0] - a„ < 0. 

A proof is given in Supplementary Section 8.1. 
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Remark: Multiple testing correction in (3.1) with C = 0- We could modify the 
correction in (3.1) using C = 0: the statement in Theorem 2 can then be derived when 
making the additional assumption that 

sup sup |G'^j^2(u)| < oo, (3.2) 

where Fn,z{-) ~ Fz{ ) is the distribution function appearing in (3.1) which depends in 
the asymptotic framework on n and (mainly on) p = pn- Verifying (3.2) may not be 
easy for general matrices il = ^n,p„ ■ However, for the special case where Zi, . . . ,Zp are 
independent, 

G'Au)=pip{u)il-^u)r-' 
which is nicely bounded as a function of u, over all values of p. 

3.2. Multiple testing of general hypotheses 

The methodology for testing many general hypotheses Hq q. with \Gj\>l,j~l,...,'m 
is the same as before. Denote by Sq,g — {j, Hq^Gj does not hold} and by 5q ^ = 
{j; Hq^q. holds}; note that these sets are determined by the true parameter vector 
/3°. Since the p-value in (2.14) is of the form = 1 — JcjilGj), '^e consider 

Fg,z=P[ min (1 - Jg,(7g,,z)) < c], 7g,z = max(a„,p;j(cr)|Zj|) 

j = l,...,rn jeG 

which can be easily computed via simulation (and it is independent of a). We then define 
the corrected p-value as 

^'corr;Gj = Pg.,z{PGj +C)i 

where C > is a small value such as C = 0.01; see also the definition in (3.1) and the 
corresponding discussion for the case where C = (which now applies to the distribution 
function Fq^z instead of Fz). We denote by 80,0. = {j\ -Pcorr;Gj < «} and Vg,q = 
Sc.a n S^ Q. 

If ^Gj(') has a bounded first derivative, for all j, we can obtain the same result, 
under the same conditions, as in Theorem 2 (and without making a condition on the 
cardinalities of Gj). If Jcji') has not a bounded first derivative, we can get around this 
problem by modifying the p-value Pgj in (2.14) to Pgj = 1 — Jg, {iGj — for any (small) 
V > and proceeding with Pg^ ■ 

4. Sufficient conditions for detection 

We consider detection of alternatives ^ or q with \G\ > 1. We use again the 
notation Sq as in Section 3 and denote by a„ ^ &„ that a„/6„ — >■ 00 (n -> cxd). 
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Theorem 3. Consider the setting and assumptions as in Theorem 1. 
1. When considering individual hypotheses Hq ^: if j G 5*0 with 

|/3°| » a„,p;,(a)-i I (Px)„r 'niax(A„l) 

there exists an q;„ — {n ^ oo) such that 

< ctn] 1 {n ^ oo), 



while we still have for j S S^: limsup„_^gQ ^[Pj ^ ctn] ~ cun < (see Corollary 3). 

2. When considering individual hypotheses Hq c with G ~ G„ and |G„| > 1: if Hq q 
holds, with 

max \an.p-j{a)Pjjl3"\ > max(max \Aj\, \/log(|G„|)), 

there exists an q;„ — >■ (n — ^ oo) such that 

V[Pg„ <an] ^ 1 (n^oo), 

while if Ho,c holds, hmsup„^o^ IP[-Pg„ < ctn] — < (see Corollary 3). 

3. When considering multiple hypotheses Hqj: if for all j E Sq, 

> a„^p;j((T)-i|(Px)„-r'max(A,,yiog(pJ) 
there exists an q;„ — )■ (n — )■ oo) such that 

Pifcorry- < ««] ^ 1 (n Oo) for j € 5*0 

while we still have that limsup„_j.o^ lP[Ki„ > 0] — a„ < (see Theorem 2). 

4. If in addition, an,p-j{cr) — >■ oo for all j appearing in the conditions on /3°, we can 
replace in all the statements 1-3 the relation by "> C" , where < C < oo is 
a sufficiently large constant. 

A proof is given in Supplementary Section 8.1. Under the additional assumption of 
Lemma 2, where the Lasso is used as initial estimator and using the bounds in (2.12), 
we obtain the bound (for statement 1 in Theorem 3): 

i,°i,c„^ C°"^;;^'^-| ('^)-^'-^^...,,w-), (4.1) 

where < ^ < 1/2. This can be sharpened using the oracle bound, assuming known 
order of sparsity: 



Aorac;j = DsQ,n max a„.py (ct) I (Px)jfc I \/log(p„)/n 
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for some D > sufficiently large (for example, assuming sg.n is bounded, and replacing 
so,n by 1 and choosing D > sufficiently large). It then sufhces to require 

|^0| ^ ( "^7;'i^^)^^-' .o,„(^^)^/^ ,.pff^. . ) for 3. in Th. 3, 

(4.2) 

and analogously for the second statement in Theorem 3. 



4.1. Order of magnitude of normalizing factors 

The order of a„^py (CT) is typically much larger than y/n since in high dimensions, fljj 
is very small. This means that Ridge regression f5j has a much faster convergence rate 
than 1/y/n for estimating the projected parameter 0'^. This looks counter-intuitive at 
first sight: the reason for the phenomenon is that ||6'"||2 can be much smaller than |j/3*'||2 
and hence. Ridge regression (which estimates the parameter 0'^) is operating on a much 
smaller scale. This fact is essentially an implication of the first statement in Lemma 
1 (without the "miuj" part) since the eigenvectors are normalized with X]r=i ^jr ~ ^ 
and q is large when p is large. For further discussion about the fast convergence rate 
a„p.j(CT)~^, see Supplementary Section 8.4. 

While a„ j,.j(cr)~^ is usually small, there is compensation with {Px.)jj^ which can be 
rather large. In the detection bound in e.g. the first part of (4.2), both terms appearing 
in the maximum are often of the same order of magnitude; see also Figure 3 in Supple- 
mentary Section 8.4. Assuming such a balance of terms, we obtain in e.g. the first part 
of (4.2): 

The value of Hj — max^^j |(Px)jfe|/|(^'x)jj| is often a rather small number between 
0.05 and 4, see Table 1 in Section 5. For comparison, Zhang and Zhang (2011) establish 
detection for single hypotheses Hqj with /Sj in the i/y/n range (our results in Theorem 
3 also cover general hypotheses i?o.G„ with |G„| potentially large, as well as multiple 
testing). For the extreme case with G„ = {1, . . . we are in the setting of detection 
of the global hypotheses, see for example Ingster et al. (2010) for characterizing the 
detection boundary in case of independent covariables. Our analysis of detection is only 
providing sufficient conditions, for rather general (fixed) design matrices. 



5. Numerical results 



As initial estimator for /3corr in (2.8), we use the Scaled Lasso with scale-independent 
regularization parameter ALasso = 2-\/log(p)/7i: it provides an initial estimate /3init as 



18 



P. Biihlmann 



well as an estimate a for the standard deviation a. The parameter A for Ridge regression 
in (2.2) is always chosen as A = 1/n, reflecting the assumption in Theorem 1 that it 
should be small. 

For single testing, we construct p-values as in (2.13) or (2.14) with Aj from (2.12) 
with ^ = 0.05. For multiple testing with familywise error control, we consider p-values 
as in (3.1) with C = (and A^- as above). 

5.1. Simulations 

We simulate from the linear model as in (1.1) with e ~ A/'„(0,/), n = 100 and the 
following configurations: 

(Ml) For both p e {500, 2500}, the fixed design matrix is generated from a realization 
of n i.i.d. rows from Afp{0,I). Regarding the regression coefficients, we consider 
active sets So = {1,2, ...,So} with sq G {3,15} and three different strengths of 
regression coefficients where (3j=b {j e So) with b e {0.25, 0.5, 1}. 

(M2) The same as in (Ml) but for both p g {500,2500}, the fixed design matrix is 
generated from a realization of n i.i.d. rows from A/^(0, S) with "Ej^ = 0.8 (j 7^ k) 
and Sjj — 1. 

The resulting signal to noise ratios SNR — ||X/3°||2/cr are rather small: 



p e {500, 2500} 


(3,0.25) 


(3,0.5) 


(3,1) 


(15,0.25) 


(15,0.5) 


(15,1) 


(Ml) 


0.46 


0.93 


1.86 


1.06 


2.13 


4.26 


(M2) 


0.65 


1.31 


2.62 


3.18 


6.37 


12.73 



Here, a pair such as (3, 0.25) denotes the values of sq = 3, b — 0.25 (where b is the value 
of the active regression coefficients). 

We consider the decision-rule at significance level a — 0.05 

reject Hqj if Pj < 0.05, (5.1) 

for testing single hypotheses where Pj is as in (2.13) with plugged-in estimate a. The 
considered type I error is the average over non-active variables: 

jess 

and the average power is 

jeSo 

For multiple testing, we consider the adjusted p- value Pcon-j from (3.1): the decision 
is as in (5.1) but replacing Pj by Pcorry- We report the familywise error rate (FWER) 
1P[^0.05 > 0] and the average power as in (5.3) but the latter with using Pcon j - The results 
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single testing (Ml) 



single testing (M2) 



1e-04 0.01 0.1 

average type I error 



1e-04 0.01 0.1 

type I error 



(a) 



(b) 



inultipie testing (M1) 



multiple testing (M2) 



- Q o CP q O 

0.002 0.01 0.1 0.2 



FWER 

(c) 



(d) 



Figure 1. Simulated data as described in Section 5.1. (a) and (b): Single testing with average type I error 
(5.2) on X-axis (log-scale) and average power (5.3) on y-axis. (c) and (d): Multiple testing with familywise 
error rate on x-axis (log-scale) and average power (5.3), but using Pcorr;ji on y-axis. Vertical dotted line 
is at abscissa 0.05. Each point corresponds to a model configuration, (a) and (c): 12 model configurations 
generated from independent covariates (Ml); (b) and (d): 12 model configurations generated from oqui- 
dependent covariates (M2). When an error is zero, we plot it on the log-scale at abscissa 10~*. 



are displayed in Figure 1, based on 500 simulation runs per setting. The subfigures (c) 
and (d) show that the proposed method sometimes exhibits a too large familywise error 
rate in multiple testing. However, the corresponding number of false positives are still 
small except for the most extreme model configuration, see Table 3 in Supplementary 
Section 8.3. 
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5.2. Values of Px 

The detection results in (4.1) and (4.2) depend on the ratio Kj ~ maxk^j |(Px)jfc|/|(^x)jj|- 
We report in Table 1 summary statistics of {Kj}j for various datasets. We clearly see 



dataset, (n,p) 


miuj Kj 


0.25-q{Kj}j 


med{Kj}j 


0.75-q{Kj}_, 




(Ml), (100,500) 


0.21 


0.27 


0.29 


0.31 


0.44 


(Ml), (100,2500) 


0.27 


0.34 


0.36 


0.39 


0.54 


(M2), (100,500) 


0.20 


0.26 


0.29 


0.32 


0.45 


(M2), (100,2500) 


0.26 


0.33 


0.36 


0.39 


0.59 


Motif, (143,287) 


0.05 


0.10 


0.13 


0.18 


0.47 


Ribofiavin, (71,4088) 


0.29 


0.54 


0.65 


0.77 


1.73 


Leukemia, (72, 3571) 


0.32 


0.44 


0.50 


0.58 


1.57 


Colon, (62,2000) 


0.28 


0.50 


0.57 


0.67 


1.36 


Lymphoma, (62,4026) 


0.34 


0.52 


0.63 


0.78 


1.49 


Brain, (34, 5893) 


0.51 


0.63 


0.67 


0.74 


2.44 


Prostate, (102,6033) 


0.26 


0.45 


0.57 


0.74 


3.67 


NCI, (61,5244) 


0.37 


0.52 


0.61 


0.79 


1.76 



Table 1. Minimum, maximum and three quartiles of for various designs X from different 

datasets. The first four are from the simulation models in Section 5.1. Although not relevant for the 
table, "Motif" (see Section 5.3) and "Riboflavin" have a continuous response while the last six have a 

class label (Dcttling, 2004). 

that the values of Hj are typically rather small which implies good detection proper- 
ties as discussed in Section 4. Furthermore, the values maxk^j |(Px)jfc| occurring in the 
construction of Aj in Section 2.4.1 are typically very small (not shown here). 

5.3. Real data application 

We consider a problem about motif regression for finding the binding sites in DNA 
sequences of the HIFla transcription factor. The binding sites are also called motifs, and 
they are typically 6-15 base pairs (with categorical values € {A,C,G,T}) long. 

The data consists of a univariate response variable Y from CHIP-chip experiments, 
measuring the logarithm of the binding intensity of the HIFla transcription factor on 
coarse DNA segments. Furthermore, for each DNA segment, we have abundance scores 
for p = 195 candidate motifs, based on DNA sequence data. Thus, for each DNA segment 
i we have € R and Xj G M^, where i = 1, . . . , ntot = 287 and p = 195. We consider a 
linear model as in (1.1) and hypotheses Hq j for j = 1, . . . ,p — 195: rejection of j then 
corresponds to a significant motif. This dataset has been analyzed in Meinshausen et al. 
(2009) who found one significant motif using their p-value method for a linear model 
based on multiple sample splitting (which assumes the unpleasant "beta-min" condition 
in (1.10)). 

Since the dataset has ntot > P observations, we take one random subsample of size 
n = 143 < p — 195. Figure 2 reports the single-testing as well as the adjusted p- 
values for controlling the FWER. There are two significant motifs with corresponding 
FWER-adjusted p-values equal to 0.009 and 0.023 (whereas the method in Meinshausen 
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et al. (2009) based on the total sample with ntot only found one significant variable 
with FWER-adjusted p-value equal to 0.006; the same variable which achieves p-value 
0.009 with our method). Interestingly, one of the significant motifs (with p-value 0.023) 
is known to be a true binding site for HIFla, thanks to biological validation experiments. 

When compared to the Bonferroni-Holm procedure for controlling FWER based on 
the raw p- values as shown in Figure 2(a), we have for the variables with smallest p- values: 

method as in (3.1): 0.009, 0.023, 0.137, 
Bonferroni-Holm: 0.011, 0.027, 0.176. 

Thus, for this example, the multiple testing correction as in Section 3 does not provide 
large improvements in power over the Bonferroni-Holm procedure; but our method is 
closely related to the Westfall- Young procedure which has been shown to be asymptoti- 
cally optimal for a broad class of high-dimensional problems (Meinshausen et al., 2011). 



motif regression: single testing 



motif regression: multiple testing 



0) 

ro o 

> C\J - 

Q. O 
■D 

0) o 

o 

■D 

f in 



IT 



50 



100 
variables 



150 



200 



50 



100 

variables 



150 



200 



(a) 



(b) 



Figure 2. Motif regression witli n = 143 and p = 195. (a) Single-testing p-values as in (2.13); (b) 
Adjusted p-values as in (3.1) for FWER control. The p-values are plotted on the log-scale. Horizontal 
line is at y = 0.05. 



6. Finite sample results 

We present here finite sample analogues of Theorem 1 and 2. Instead of assumption (A), 
we assume the following: 
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(A') There are constants Aj > such that 

n n, = l {a„,p;,(a) ^(Px),fe(Anit;fc -P'k)\< A,}] >l-K 

for some (smaU) < k < 1. 
We then have the following result. 

Proposition 3. Assume model (1.1) with Gaussian errors. Consider the corrected 
Ridge regression estimator /3corr in (2.8) with regularization parameter A > 0, and assume 
(2.3) and condition (A'). Then, with probability at least 1 — k, for j e {1, . . . ,p} and if 
Hqj holds: 

corr;j | S On.p; 

j{cr)\Zj\ + Aj + ||a„^p6(A)||oo, 
||a«,p&(A)||oo = max a„^py (a)|5,(A)| < ^^—j^n^/^cT-^\\e%X^i^^oi^)-\ 

Similarly, with probability at least 1 — k, for any subset G C {1, . . . , p} and if -ffo,G holds: 
rn£K;a„,p;j(cr)|/3corr;j I < (a„,py ((t)|Zj| + A^) + ||a„^p6(A)||oo- 

A proof is given in Supplementary Section 8.1. Due to the third statement in Lemma 
Ij f^min(A)^^/^ is bounded for a bounded range of A G (0, C]. Therefore, the bound for 
||'in,p^(A)||oo can be made arbitrarily small by choosing A > sufficiently small. 

Theorem 2 is a consequence of the following finite sample result. 

Proposition 4. Consider the event 8 with probability V[8] > 1 — k where condition 
(A') holds. Then, when using the corrected p-values from (3.1), with C > (allowing also 
C = 0), we obtain approximate strong control of the familywise error rate: 

P[K > 0] < Fz{Fz\a) - C + 2(27r)-i/2||a„^p6(A)||oo) + (1 - P[f ])• 

A proof is given in Supplementary Section 8.1. We immediately get the following 
bound for C > 0: 

P[T4 > 0] < a + sup \G'z{u)\2{2TT)-^'''\\an,ph{X)\U + (1 - P[f])- 

U 

7. Conclusions 

We have proposed a novel construction of p-values for individual and more general hy- 
potheses in a high-dimensional linear model with fixed design and Gaussian errors. We 
have restricted ourselves to max-type statistics for general hypotheses but modification 
to e.g. weighted sums is straightforward using the representation in Proposition 2. A key 
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idea is to use a linear, namely the Ridge estimator, combined with a correction for the 
potentially substantial bias due to the fact that the Ridge estimator is estimating the 
projected regression parameter vector onto the row-space of the design matrix. The fact 
that we can succeed with a corrected Ridge estimator in a high-dimensional context may 
come as a surprise, as it is well known that Ridge estimation can be very bad for say 
prediction: for more explanations see below. A related idea of using a linear estimator 
coupled with a bias correction for deriving confidence intervals has been earlier proposed 
by Zhang and Zhang (2011). 

No tuning parameter. Our approach approach does not require the specification of 
a tuning parameter, except for the issue that we crudely bound the true sparsity as 
in (2.12); we always used ^ = 0.05, and the Scaled Lasso initial estimator does not 
require the specification of a regularization parameter. All our numerical examples were 
run without tuning the method to a specific setting, and error control with our p-value 
approach is often slightly conservative while the power seems reasonable. Regarding the 
latter, construction of optimal tests for Hq j or more general Hq g in a high-dimensional 
linear model with dependent design is an open problem. Furthermore, our method is 
generic which allows to test for any Hq.g regardless whether the size of G is small or 
large: we present in the Supplementary Section 8.2 an additional simulation where \G\ is 
large. For multiple testing correction or for general hypotheses with sets G where \G\ > 1, 
we rely on the power of simulation since analytical formulae for max-type statistics under 
dependence seem in-existing: yet, our simulation is extremely simple as we only need to 
generate dependent multivariate Gaussian random variables. 

Small variance of Ridge estimator. As indicated before, it is surprising that corrected 
Ridge estimation performs rather well for statistical testing. Although the bias due to 
the projection Px can be substantial, it is compensated by small variances a'^n~^rtjj of 
the Ridge estimator. It is not true that ^jj's become large as p increases: that is, the 
Ridge estimator has small variance for an individual component when p is very large, 
see Section 4.1. Therefore, the detection power of the method remains good as discussed 
in Section 4. Viewed from a different perspective, even though \{PK)jjPj \ may be very 
small, the normalized version an^p-j{cr)\{Px)jjf3j \ can be sufficiently large for detection 
since an,p-j{(j) may be very large (as the inverse of the square root of the variance). The 
values of Px can be easily computed for a given problem: our analysis about sufficient 
conditions for detection in Section 4 could be made more complete by invoking random 
matrix theory for the projection Px (assuming that X is a realization of i.i.d. row- 
vectors whose entries are potentially dependent). However, currently, most of the results 
on singular values and similar quantities of X are for the regime p < n (Vcrsliynin, 2012), 
which leads in our context to the trivial projection Px = /, or for the regime p/n — > C 
with < C < oo (El Karoui, 2008). 

Extensions. Obvious but partially non-trivial model extensions include random design, 
non-Gaussian errors or generalized linear models. From a practical point of view, the 
second and third issue would be most valuable. Relaxing the fixed design assumption 
makes part of the mathematical arguments more complicated, yet a random design is 
better posed in terms of identifiability. 
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8. Supplementary Section 
8.1. Proofs 

Proof of Proposition 1. 

The statement about the bias is given in Shao and Deng (2011) (proof of their Theorem 
1). The covariance matrix of j3 is 

Then, for the variance we obtain Var(/3j) = n^^a^iljj > n~^(T^f2niin(A). □ 

Proof of Proposition 2. 
We write 

Pcorr-j = 0j - E[/3,]) + - 5I(^x),fcAnit;fe + (E[/3j] " 0^). 

The result then follows by defining Z.j = /3j - E[/3j] and using that 9'^ = {Py^l3^)j = 
(^x)«/3° + Em.(^x),./3,°. □ 

Proof of Proposition 3 (basis for proving Theorem 1 ). 
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The bound from Proposition 1 for the estimation bias of the Ridge estimator leads to: 

||a„,p6(A)||oo = max an,p;j{a)\E0j - 0°\ 
j=i,....P 

X\\9°\\2XrmnM^y^ 



< 



< A||(?0||2A„„„^o(S)"V-l7ll/2ani„(A)-l/2. 

By using the representation from Proposition 2, invoking assumption (A') and assuming 
that the null-hypothesis Hqj or Hq g holds, respectively, the proof is completed. □ 

Proof of Theorem 1. 

Due to the choice of A = A„ we have that ||a„,p6(A„)||oo = o(l) {n oo). The proof then 
follows from Proposition 3 and invoking assumption (A) saying that the probabilities for 
the statements in Proposition 3 converge to 1 as n — )> cxi. □ 

Proof of Proposition 4 (basis for proving Theorem 2). 

Consider the set £ where assumption (A') holds (whose probability is at least V[£] > 
1 — k). Then, on £ and for j G Sq: 

Pj - 2(l-$(a„^p;j((j)|/3corr;j|-Aj)) 

> - $(a„,py(a)|/3corr;j - (^x) jfc (/3init;fc " 

> 2(1 - $(a„,p;,((T)|Z,|)) - 2(2^)-i/2||a„XA)||oo, 

where in the last inequality we used Proposition 2 and Taylor's expansion. Thus, on £: 
minPj > min2{l-<^>{an,p;j{a)\Zj\)) -2{2TT)-^^^\\an.pb{\)\\^ 

> min 2(l-$(a„.py(CT)|Zj|)) -2(27r)-i/2||a„,p6(A)||eo. 
3=1, ■■■,P 



Therefore, 



'[min Pj <c]< ¥[£ n {min Pj < c] + ¥[£"] 



< P[ min 2(l-$(a„,py(a)|Z,|)) <c + 2(2^)-i/2||a„XA)||oo]+P[f'= 

= Fz(c + 2(2^)-i/2||a„,p6(A)||oo) +P[f1. 
Using this we obtain: 

P[14 > 0] = P[min Pcorr-7 < a] = P[min P, < Fz^{a) - C] 



< 



Fz{Fz\a) - C + 2(27r)-i/2||a„^p6(A)||oo) + P[f 1 
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This completes the proof. □ 
Proof of Theorem 2. 

Due to the choice of A = A„ we have that ||a„^p6(A„)||oo = o(l) (n — > oo). Further- 
more, using the formulation in Proposition 4, assumption (A) translates to a sequence 
of sets £n with P[£'„] — 1 (n — > oo). We then use Proposition 4 and observe that 
for sufficiently large n: Fz{Fz\a) - C + 2(27r)-i/2||a„,p&(A„)||oo) < FziFz\a)) < a. 
The modification for the case with a„ — > sufficiently slowly follows analogously: note 
that the second last inequality in the proof above follows by monotonicity of Fz{-) and 
C > 2(27r)^^/^||a„,p6(A„)||oo for n sufficiently large. This completes the proof. □ 

Proof of Theorem 3. 

Throughout the proof, a„ — is converging sufficiently slowly, possibly depending on 
the context of the different statements we prove. Regarding statement 1: it is sufficient 
that for j (z So, 

an,p;j(o')|^coiT;j| > Aj. 

From Proposition 2 we see that this can be enforced by requiring 

a«,p;,(a)(|(Px)„-/3°| - I 5](Px),fc(A„it;fc - /3fc°)| - \Z,\ - |6,(A)|) » A,. 

Since \an,p-j{'j)Y,k^j{Px.)jk0init;k - Pk)\ ^ Aj, this holds if 

l^il ^ ifp \ ^ r^max(Aj,a„,py(cr)Zj,a„,p;j(cr)fej(A)). (8.1) 

Due to the choice of A = A„ (as in Theorem 1) we have an,p-j{iy)bj{X) < ||a„.p((T6(A)||oo = 
o(l). Hence (8.1) holds with probability converging to one if 

completing the proof for statement 1. 

For proving the second statement, we recall that 

1 - Jg(c) = P[max (a„,py (a) + A^-) > c]. 

Denote by = iiiaxj^GiO'n,p:j{o')\Zj\ + Aj) <W~ maxjgc o,n,p;j{(^)\Zj\ + maxj-gg Aj. 
Thus, 

V[W > c] < V[W > c]. 
Therefore, the statement for the p-value V[Pg < an] is implied by 
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Using the union bound and the fact that cin.p-j{o')\Zj\ ^ A/'(0, 1) (but dependent over 
different values of j), we have that 

maxan,p,j{o-)\Zj\ = Op( Vlog(|G|)). 

Therefore, (8.2) holds if 

7g = maxa„.p;j(cr)|/3corr:j| > max(niax Aj, 0og(|G|)). 

The argument is now analogous to the proof of the first statement above, using the 
representation from Proposition 2. 

Regarding the third statement, we invoke the rough bound 

with the non-truncated Bonferroni corrected p-value at the right-hand side. Hence, 

max7^corr:7 — 

jeSo 

is implied by 

maxpP, = max2p(l - $((a„ „.,(CT)|/3corr:j I - < a„. 

jeSo jeSo 

Since this is a standard Gaussian two-sided tail probability, this can be enforced (for 
certain slowly converging a„) by 

max2exp (log(p) - (a„_p.j(cr)|/3corr;j I - Aj)^/2) = op(l). 

The argument is now analogous to the proof of the first statement above, using the 
representation from Proposition 2. 

The fourth statement involves slight obvious modifications of the arguments above. □ 

8.2. P-values for i?o,G with \G\ large 

We report here on a small simulation study for testing i?o,G with G — {1, 2, . . . , 100}. 
We consider model (M2) from Section 5.1 with 4 different configurations and we use 
the p-value from (2.14) with corresponding decision rule for rejection of Hq q if the p- 
value is smaller or equal to the nominal level 0.05. Table 2 describes the result based 
on 500 independent simulations. The method works well with much better power than 
multiple testing of individual hypotheses but slightly worse than average power for testing 
individual hypotheses without multiplicity adjustment. This is largely in agreement with 
the theoretical results in Theorem 3. Furthermore, the type I error control is good, even 
for the model (M2),p=2500,s=15,b=l where the multiple testing error control performs 
poorly. 



30 P. Biihlmann 



model 


P [false rejection] 


P[true rejection] 


(power mult., power indiv.) 


(M2), p = 


500, s 


= 15, 6 = 0.5 


0.08 


0.68 


(0.14,1.00) 


(M2), p = 


500, s 


= 15, 6 = 1 


0.07 


1.00 


(0.58,1.00) 


(M2), p = 


2500, 


s = 15, 6 = 0.5 


0.01 


0.19 


(0.06,0.40) 


(M2), p = 


2500, 


s = 15, 6 = 1 


0.06 


0.99 


(0.30,0.87) 



Table 2. Testing of general hypothesis Hq q with |G| = 100 using the p-value in (2.14) with 
significance level 0.05. Second column: type I error; Third column: power; Fourth column: comparison 
with power using multiple testing and average power using individual testing without multiplicity 
adjustment (both for all p hypotheses Hqj {j = I, . . . ,p)). 



8.3. Number of false positives in simulated examples 



We show here the number of false positives V — Vq.qs in the simulated scenarios where 
the FWER was found too large. Except for model (M2),p=2500,s=15,b=l, the number 



model 


F[V = 0] 


r[v = 1] 


p[y = 2] 


F[V = 3] 


F[4<V < 8] 


p[y > 9] 


(Ml), p = 


2500,s 


= 15,6 = 1 


0.798 


0.186 


0.016 











(M2), p = 


500, s 


= 15, 6 = 0.5 


0.858 


0.132 


0.010 











(M2), p = 


500, s 


= 15, 6 = 1 


0.790 


0.186 


0.020 


0.004 








(M2), p = 


2500, . 


= 15, 6 = 0.5 


0.822 


0.170 


0.008 











(M2), p = 


2500, J 


= 15, b = 1 


0.076 


0.222 


0.268 


0.200 


0.234 






Table 3. Probabilities for false positives for simulation models from Section 5.1 in scenarios where the 
FWER is overshooting the nominal level 0.05. 



of false positives is small although the FWER is larger than 0.05. For the extreme model 
(M2),p=2500,s=15,b=l, which has a too large sparsity and a too strong signal strength, 
we would need to increase ^ in (2.12) to achieve better error control. 

8.4. Further discussion about p-values and bounds Aj in 
assumption (A) 

The p-values in (2.13) and (2.14) are crucially based on the idea of correction with the 
bounds Aj in Section 2.4.1. The essential idea is contained in Proposition 2: 

= an,p-j{(j){PyL)jj - an,p;j (o-) ^(Px)jfe(Anit;/c " PI) + a„,p;j(a-)Zj + negligible term. 
There are three cases. If 

an,p-M 5](^x)jfc(Anit;fe " 1^1) = Op(l), (8.3) 

a correction with the bound Aj would not be necessary, but of course, it does not hurt 
in terms of type I error control. If 

«n,p;,(a) ^(Px),fe(A„it;fe " PI) ^ V, (8.4) 
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for some non-degenerate random variable V, the correction with the bound Aj is neces- 
sary and assuming that is of the same order of magnitude as V, we have a balance 
between A^- and the stochastic term an,p-j{cr)Zj. In the last case where 

«n,py(o-)^(Px)j7c(Anit;fe " /3fc) OO, (8.5) 

the bound Aj would be the dominating clement in the p-value construction. We show in 
Figure 3 that there is empirical evidence that (8.4) applies most often. 



p=2500,sparsity=3,b=1 p=2500,sparsity=1 5,b=1 




I 1 1 1 1 I 1 1 1 1 

-4 -2 2 4 -4 -2 2 4 



Figure 3. Histogram of projection bias an,p;j (f) J^fc^z^j (^x)jfc (ftnit;fc ~ over all values j = 
l,...,p and over 100 independent simulation runs. Left; model (M2),p=2500,s=3,b=l; Right; model 
(M2),p=2500,s=15,b=l. 

Case (8.5) is comparable to a crude procedure which makes a hard decision about 
relevance of the underlying coefficients: 

if On.py (o')|/3coiT;j| > Aj holds, then Hqj is rejected, 

and the rejection would be "certain" corresponding to a p-value with value equal to 0; 
and in case of a "<" relation, the corresponding p-value would be set to one. This is an 
analogue to the thresholding rule: 



if IAmt;j| > Ainit holds, then Hqj is rejected, 



(8.6) 
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where Ajnit > ||/3init-/?°|loo, e.g., using a bound where Ai„it > HAnit-/?"!!!- For example, 
(8.6) could be the variable selection estimator with the thresholded Lasso procedure 
(van de Geer et al., 2011). An accurate construction of Ajnit for practical use is almost 
impossible: it depends on a and in a complicated way on the nature of the design through 
e.g. the compatibility constant, see (1.7). 

Our proposed bound A-,- in (2.12) is very simple. In principle, its justification also 
depends on a bound for ||/3init — P'^Wi, but with the advantage of "robustness". First, 
the bound a„^p; j (cr) max^^j |(^x)jfc||!Anit — appearing in (2.10) is not depending 
on a anymore (since ||Anit — Z^'^Hi scales linearly with a). Secondly, the inequality in 
(2.10) is crude implying that Aj in (2.12) may still satisfy assumption (A) even if the 
bound of IIAnit — /3°||i is misspecified and too small. The construction of p-values as in 
(2.13) and (2.14) is much better for practical purposes (and for simulated examples) than 
using a rule as in (8.6): case (8.5) seems unlikely and our procedure enjoys "robustness" 
properties. 



